GROUP TO DO LIST

  1. Match up language within code between group members how-tos
  2. Summary/Discussion/Reference sections
  3. Match up plotting style as much as possible
  4. Organize git repo
  5. Update readme
  6. Improve readme style/info

Introduction

Linear regression has become widely known as a backbone of modern statistics. Even as more complex, “black box”-style machine learning techniques increase in popularity, many statisticians and researchers still fall back on regression for its interpretability and simpleness. However, linear regression relies on a number on assumptions that may not always be true in practice, such as the constant, monotonic linearity of predictor variables in relation to the response. In this guide, we explore the use of splines to help model predictor variables that may have changing relationships across their domain. These techniques help us to match the predictive power seen in some more advanced machine learning algorithms while keeping the benefits gained by using regression. We show examples in three popular statistical modelling languages - python, R, and STATA.

Data

In this guide, we will be using the Wage dataset from the R package ISLR. This data is also used in the book Introduction to Statistical Learning. This dataset contains wages from 3,000 Mid-Atlantic, male workers, between the years 2003-2009, along with a select number of other personal demographics. We retain the variables for wage, age, year, and education for our analysis. Our goal is to examine the relationship between age, year, and education and workers’ yearly wage.

Below is a table and plots explaining the variables used in this example.

Data Structure

Variables Role Type Explanation
Wage Response Numerical Worker’s Wage
Age Predictor Numerical Age of the worker
Year Predictor Numerical Year when the data was collected
Education Predictor Categorical The education of that worker

Table 0.1: Table explain the type, role, along with the explanation for each variable.

Plots

Figure 0.2: Histogram or Bar chart for variables used in the analysis

Method

We will first calculate a simple linear regression as a baseline. We will then implement four different spline-like techniques on the “age” predictor variable: a step function, polynomial regression, regression splines which are basis spline, and natural spline.

At each step, we will check for fit quality, noting any potential improvements along the way. We will conclude with a retrospective and summary of what we learned.

Polynomial Regression

Polynomial Regression is a simple method to move beyond the linearity from the Linear Regression. This method extends the Linearity property of the Linear Regression by adding extra terms of the particular predictor. The terms added in the model are obtained by raising the power of the original predictor.

This tutorial will extend the Linearity by adding the polynomial terms of Age. Below is the model used in this tutorial.

\[ Wage_i = \beta_{0} + \beta_{1}(Year_i) + \beta_{2}(Education_i) + \beta_{3}Age_{i} + \beta_{4}Age_{i}^{2} + \beta_{5}Age_{i}^{3} + \epsilon_i \]

The rule of thumb about how many polynomial term should be applied points out in the An introduction to statistical learning : with applications in R. Below is a quote from the book.

“Generally speaking, it is unusual to use d (the number of polynomial terms) greater than 3 or 4 because for large values of d, the polynomial curve can become overly flexible and can take on some very strange shapes.”

Step Function

This method’s main idea is to cut the range of the variable into K different groups. Then, fitting each group with a different constant. Furthermore, we will have only (K-1) coefficient for fitting the K-Step function.

However, there is no standard rule about where to place the knot or how many knots we should include in the analysis. If the natural breakpoints exist, the number of knots and their locations should follow the natural breakpoint.

In this example, the step function will be applied to the Age variable. Besides, Age will be cut into six bins. Hence, below is the regression model when using the Step Function on the Age variable.

\[ Wage_i = \beta_{0} + \beta_{1}(Year_i) + \beta_{2}(Education_i) + \beta_{3}\space I_{(28.3, 38.7]} + \beta_{4}\space I_{(38.7,49]} + \beta_{5}\space I_{(49,59.3]} + \beta_{6}\space I_{(59.3,69.7]} + \beta_{7}\space I_{(69.7,80.1]} + \epsilon_i \]

Where \(I_{(a,b]}\) is a indicator function of age where this function will equal to 1 if the Age of that observation satisfies \(a < Age \leq b\), and this function will equal to 0 if oterwise.

Regrssion Spline

One major problem of the Polynomial Regression is that it required a high degree of polynomial terms in order to make the model more flexible (or fit the data better). However, this will lead to an overfitting problem. One way to avoid this situation is to use the Regression Spline.

Regression Spline will use a lower degree of the polynomial term but provide as much flexibility as the Polynomial Regression. The key of this technic is mainly about the number of knot. Regression Spline will keep the polynomial term fixed at a lower degree while changing the number of knot.

There are many way to locate where should we put the knots. The first one is to place many knots where that predictors fluctuate a lot while less knot where the predictor does not change much. The second way is to place them uniformlly.

For the number of knots we will use, or the degree of freedom, we can use the technic called K-Fold Cross Validation to determine the number of knot.

The difference between Basis Spline and Natural Spline is that for each boundary of the natural splines, it will be enforced to be a linear function.

In this tutorial, we will use the cubic basis spline and cubic natural spline. To determine the degree of freedom (df), we will use K-fold Cross Validation when K is equal to 5. Below is the df for each type of cubic spline.

  • Basis Spline: df = 4 + knots
  • Natural Spline: df = 2 + knots

Core Analysis

Python

#!/usr/bin/env python
# coding: utf-8
# In[1]:
#Packages required 
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
#%matplotlib inline
import statsmodels.api as sm
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from patsy import dmatrix
# In[2]:
#Let's read in the data file 
data = pd.read_csv("/Users/kwschulz/STATS506/Stats506_Project/Dataset/data.csv")
# In[3]:
#Take a quick glance at what the data looks like
data.head()
# In[4]:
#Let's check to see if we have any missing values
data.isna().sum()
# In[5]:
#Filter to variables for our analysis 
data = data[["wage", "age", "education", "year"]]
# In[6]:
#Map education to ordinal scale
education_map = {"1. < HS Grad":1,"2. HS Grad":2,
                "3. Some College":3, "4. College Grad":4,
                "5. Advanced Degree":5}
data['education'] = data.education.map(education_map)
# In[7]:
#Lets check the distribution of our predictors
data[["wage", "age"]].hist(layout=(2,1), figsize=(15,15))
plt.show()
plt.savefig('hist.png')
# In[8]:
#checking year distribution
data.year.value_counts().reindex([2003, 2004, 2005, 2006, 2007, 2008, 2009]).plot(kind='bar', 
                                                                                  title='year', 
                                                                                  ylabel='count', 
                                                                                  figsize=(7.5,7.5))
plt.savefig('year_bar.png')
# In[9]:
#checking education distribution 
data.education.value_counts().reindex([1, 2, 3, 4, 5]).plot(kind='bar', 
                                                            title='education', 
                                                            ylabel='count', 
                                                            figsize=(7.5,7.5))
plt.savefig('education_bar.png')
# In[10]:
#linear regression model
model = sm.OLS(data["wage"], sm.add_constant(data.drop('wage',axis=1))).fit()
# In[11]:
#let's check how it did
model.summary()
# In[12]:
#let's cut age into 6 bins - stepwise
data["age_cut"] = pd.cut(data.age, bins=6, labels=False)
# In[13]:
#now let's model age with bins 
model2 = sm.OLS(data["wage"], sm.add_constant(data.drop(['wage','age'],axis=1))).fit()
# In[14]:
#model 2 summary
model2.summary()
# In[15]:
#let's check out the scatter plot of age v wage 
data.plot(x="age", y="wage", kind='scatter', figsize=(7.5,7.5))
# In[16]:
#2nd degree polynomial 
p = np.poly1d(np.polyfit(data["age"], data["wage"], 2))
t = np.linspace(0, 80, 200)
plt.plot(data["age"], data["wage"], 'o', t, p(t), '-')
rs = sm.OLS(data["wage"], 
            np.column_stack([data["age"]**i for i in range(2)]) ).fit().rsquared
plt.title('r2 = {}'.format(rs))
plt.show()
plt.savefig('poly2.png')
# In[17]:
#3rd degree polynomial 
p = np.poly1d(np.polyfit(data["age"], data["wage"], 3))
t = np.linspace(0, 80, 200)
plt.plot(data["age"], data["wage"], 'o', t, p(t), '-')
rs = sm.OLS(data["wage"], 
            np.column_stack([data["age"]**i for i in range(3)]) ).fit().rsquared
plt.title('r2 = {}'.format(rs))
plt.show()
plt.savefig('poly3.png')
# In[18]:
#4th degree polynomial
p = np.poly1d(np.polyfit(data["age"], data["wage"], 4))
t = np.linspace(0, 80, 200)
plt.plot(data["age"], data["wage"], 'o', t, p(t), '-')
rs = sm.OLS(data["wage"], 
            np.column_stack([data["age"]**i for i in range(4)]) ).fit().rsquared
plt.title('r2 = {}'.format(rs))
plt.show()
plt.savefig('poly4.png')
# In[19]:
#5th degree polynomial
p = np.poly1d(np.polyfit(data["age"], data["wage"], 5))
t = np.linspace(0, 80, 200)
plt.plot(data["age"], data["wage"], 'o', t, p(t), '-')
rs = sm.OLS(data["wage"], 
            np.column_stack([data["age"]**i for i in range(5)]) ).fit().rsquared
plt.title('r2 = {}'.format(rs))
plt.show()
plt.savefig('poly5.png')
# In[20]:
#let's do a third polynomial regression 
polynomial_features= PolynomialFeatures(degree=3)
age_p = polynomial_features.fit_transform(data['age'].to_numpy().reshape(-1, 1))
model3 = sm.OLS(data["wage"], sm.add_constant(np.concatenate([data[['education', 'year']].to_numpy(), age_p], axis=1))).fit() 
# In[21]:
#check our results
model3.summary(xname=['education', 'year', 'const', 'poly(age, 3)1', 'poly(age, 3)2', 'poly(age, 3)3'])
# In[22]:
#implementing a bspline for age 
age_bs = dmatrix("bs(data.age, df=6)",{"data.age": data.age}, return_type='dataframe')
model4 = sm.OLS(data["wage"], pd.concat([age_bs, data[['education', 'year']]], axis=1)).fit() 
model4.summary()
# In[23]:
#implementing a natural spline for age 
age_ns = dmatrix("cr(data.age, df=6)",{"data.age": data.age}, return_type='dataframe')
model5 = sm.OLS(data["wage"], pd.concat([age_ns, data[['education', 'year']]], axis=1)).fit() 
model5.summary()

Stata

Before starting analysis using splines, first look at OLS regression with wage as it relates to age, year, and education. We can run the simple code below to look at this relationship.

reg wage age year edu

Stata will return the following output: Figure 2.1 OLS output for wage ~ age + year + education

To see if a non-linear relationship might be present, kernal density, pnorm, and qnorm plots can assit with this:

predict r, resid
kdensity r, normal

Figure 2.2 Kernel Density Plot

Figure 2.3 Qnorm Plot

After looking at these plots we might consider the different relationships that age may have with wage. We can plot the two-way fit between wage and age, our main variables of interest, to compare a basic linear, polynomial, and quadratic fit.

twoway (scatter wage age) (lfit wage age) (fpfit wage age) (qfit wage age)

Figure 2.4 Fitted Plot - Linear (red), polynomial (green), and quadratic (yellow) fit

Based on these plots we might be interested in trying to fit a cubic polynomial plot next.

Cubic Polynomial

To create a cubic polynomial in stata we can use the ## command with the age variable. The regression is written as before with the addition of a cubic fit:

reg wage c.age##c.age##c.age year educ

The output in Stata will look like this:

Figure 2.5 Regression with Cubic polynomial for Age

Piecewise Step Function Regression

For the piecewise step function, the steps and intercepts in Stata must be determined manually. Based on analysis in R we determined that including 6 groups with 5 cutpoints is best. The below code shows how to generate six age categories and their intercepts.

* generate 6 age variables, one for each bin * 
  * the age varaible does not have decimels * 
  generate age1 = (age - 28.33)
replace age1 = 0 if (age >= 28.33)
  generate age2 = (age-38.66)
replace age2 = 0 if age <28.33 | age > 38.66
generate age3 = (age- 48.99)
replace age3 = 0 if age <38.66 | age >=48.99
generate age4 = (age - 59.33)
replace age4 = 0 if age <48.99 | age >= 59.33 
generate age5 = (age - 69.66)
replace age5= 0 if age < 59.33 | age>=69.66
generate age6 = (age-80)
replace age6 = 0 if age <69.66
* create intercept variables* 
  generate int1 = 1
replace int1 = 0 if age >= 28.33
generate int2 = 1
replace int2 = 0 if age <28.33 | age > 38.66
generate int3 = 1
replace int3 = 0 if age <38.66 | age >=48.99
generate int4 = 1
replace int4 = 0 if age <48.99 | age >= 59.33 
generate int5 = 1
replace int5= 0 if age < 59.33 | age>=69.66
generate int6 = 1
replace int6 = 0 if age <69.66

Using these variables we can then compute a step-wise regression.

regress wage int1 int2 int3 int4 int5 int6 age1 age2 age3 age4 age5 age6 ///
  year educ, hascons

Figure 2.6 Step-wise regression for Age with 6 bins

After running the regression we can then use the predicted yhats to graph the results:

predict yhat
twoway (scatter wage age, sort) ///
  (line yhat age if age <28.33, sort) ///
  (line yhat age if age >=28.33 & age < 38.66, sort) ///
  (line yhat age if age >=38.66 & age < 48.99, sort) ///
  (line yhat age if age >=48.99 & age<59.33, sort) ///
  (line yhat age if age >=59.33 & age<69.66, sort) ///
  (line yhat age if age >=69.66, sort), xline(28.33 38.66 48.99 59.33 69.66) // this looks awful

Figure 2.7 Step-wise regression for Age with 6 bins

Basis Spline

For the basis spline, we use the command bspline, created by Roger Newson and suggested by Germán Rodríguez at Princeton. To create the spline, we call bspline, setting the x variable to age and then identifying where we would like the knots in the function. For this example I use 3 knots at 35, 50 and 65, however it should be noted that the min and max of the values need to be included in the knots parentheses. I also use a cubic spline, incidated by p(3). The last step in the line of code is the code that generates the splines for inclusions in the regression. Then the regression can be written as below.

bspline, xvar(age) knots(18 35 50 65 80) p(3) gen(_agespt)
regress wage _agespt* year educ, noconstant

Figure 2.8 Basis Spline Regression

To look at the fit for age, we can examine the two-way scatter plot between wage and age using the predicted values of the bivariate regression with splines.

regress wage _agespt*, noconstant
predict agespt
*(option xb assumed; fitted values)
twoway (scatter wage age)(line agespt age, sort), legend(off)  ///
  title(Basis Spline for Age)

Figure 2.9 Step-wise regression for Age with 6 bins

Natural Spline

This further extension is still being coded. Please see the README.md file.

R

Here is the required libraries for this tutorial.

library(tidyverse) ## This library is for data manipulation.
library(ggplot2) ## This library is for data visualization.
library(gridExtra) ## This library is also for data visualization.
library(splines) ## This library is for spline.

In this example, the author has written two additional functions. All of the code for the user-written functions is in User-written function.R.

  • wage_age This function is for plotting the scatter plot between age and wage, which including the regression line.
  • Another function is plot_kfold, which primary for doing the 5-fold cross validation to select the degree of freedom for basis and natural spline.

For the simplicity, the author decided to convert Education variable into numerical variable.

Table 3.1 The converted value of the Education variable.
Old Value New Value
1. < HS Grad 1
2. HS Grad 2
3. Some College 3
4. College Grad 4
5. Advanced Degree 5
data <- data %>% mutate(edu = as.numeric(substring(data$education, 1, 1)))

Linear regression.

First, consider the linear regression. Below is a model, along with the Quantile-Quantile plot.

model_lr <- lm(wage ~ age + edu + year, data = data)
summary(model_lr)
  wage
Predictors Estimates CI p
(Intercept) -2142.85 -3420.48 – -865.21 0.001
age 0.58 0.47 – 0.69 <0.001
edu 15.92 14.85 – 16.98 <0.001
year 1.09 0.45 – 1.72 0.001
Observations 3000
R2 / R2 adjusted 0.256 / 0.255


Table 3.2 Regression Model when using age directly.

According to table 3.2, we can conclude that the model is \(Wage = -2142.85 + 0.58(Age) + 15.92(Edu) + 1.09(Year)\). In addition, the \(R^2\) for this model is 0.2555 which is quite low. Hence, I will look at the Quantile-Quantile plot for the residual, one of the linear regression’s assumption, in order to see that whether the model violates the assumption or not.

Figure 3.3 The QQ plot, along with the Kernel density plot of the 
residual from the linear regression.

Figure 3.3 The QQ plot, along with the Kernel density plot of the residual from the linear regression.

Figure 3.4 Scatter plot between Wage and Age

Figure 3.4 Scatter plot between Wage and Age

According to the Figure 3.3 (Left), you will notice that the residuals are not normally distributed since there are some datapoints do not lie on the line.

Apart from the Figure 3.3, Figure 3.4 also shows that the relationship between Wage and Age is not a linear; therefore, this tutorial will try different type of relationship between Wage and Age.

Cubic Polynomial Regression

The first type that we will consider is the Polynomial Regression. In order to use this regression, you have to use poly() function in lm().

model_poly <- lm(wage ~ poly(age, 3) + edu + year, data = data)
summary(model_poly)
  wage
Predictors Estimates CI p
(Intercept) -2330.05 -3583.92 – -1076.18 <0.001
age [1st degree] 369.89 300.40 – 439.37 <0.001
age [2nd degree] -383.49 -453.11 – -313.88 <0.001
age [3rd degree] 80.58 11.22 – 149.95 0.023
edu 15.30 14.25 – 16.35 <0.001
year 1.19 0.57 – 1.82 <0.001
Observations 3000
R2 / R2 adjusted 0.285 / 0.283


Table 3.5 Cubic Polynomial Regression Model

Figure 3.6 Scatter plot between Wage and Age with a line 
                 indicate the cubic polynomial regression

Figure 3.6 Scatter plot between Wage and Age with a line indicate the cubic polynomial regression

According to the table 3.5, the model is \(Wage = -2330.05 + 369.89(Age) - 383.49(Age)^{2} + 80.58(Age)^{3} + 15.30(Edu) + 1.19(Year)\). The corresponding \(R^2\) is 0.2846.

Piecewise Linear Regression

model_cut <- lm(wage ~ cut(age, 6) + edu + year, data = data)
summary(model_cut)
  wage
Predictors Estimates CI p
(Intercept) -2427.01 -3689.51 – -1164.51 <0.001
cut(age, 6) [cut(age,
6)(28.3,38.7]]
18.31 13.92 – 22.71 <0.001
cut(age, 6) [cut(age,
6)(38.7,49]]
27.67 23.50 – 31.84 <0.001
cut(age, 6) [cut(age,
6)(49,59.3]]
25.24 20.73 – 29.74 <0.001
cut(age, 6) [cut(age,
6)(59.3,69.7]]
25.29 18.92 – 31.65 <0.001
cut(age, 6) [cut(age,
6)(69.7,80.1]]
8.55 -3.76 – 20.85 0.173
edu 15.51 14.45 – 16.56 <0.001
year 1.23 0.60 – 1.86 <0.001
Observations 3000
R2 / R2 adjusted 0.276 / 0.274


Table 3.7 Piecewise-Linear Regression Model

Figure 3.7 Scatter plot between Wage and Age with the
                 step function.

Figure 3.7 Scatter plot between Wage and Age with the step function.

According to the table 3.7, the model is \(Wage = -2492.40 + 21.34I_{33.5 < Age \leq 49} + 19.89I_{49 < Age \leq 64.5} + 9.20I_{64.5 < Age \leq 80.1} + 15.76(Edu) + 1.27(Year)\) where I is an indicator function. The corresponding \(R^2\) is 0.2766.

Basis Spline

In R, we will use bs() for the basis spline. Also, we have to specify the number of knot or the degree of freedom. In this tutorial, we will determine the degree of freedom instead of the number of knot.

Consider the MSE for basis spline.
Figure 3.8 The MSE of the basis spline from performing 
5-fold cross validation

Figure 3.8 The MSE of the basis spline from performing 5-fold cross validation

Figure 3.8 shows that the MSE is the lowest when the degree of freedom is equal to 6. Then, we will fit the regression witht the basis spline with degree of freedom equal to 6.

model_basis <- lm(wage ~ bs(age, df = 6) + edu + year, data = data)
summary(model_basis)
  wage
Predictors Estimates CI p
(Intercept) -2433.62 -3688.26 – -1178.97 <0.001
age [1st degree] 8.65 -12.96 – 30.26 0.433
age [2nd degree] 31.02 18.56 – 43.48 <0.001
age [3rd degree] 46.08 31.58 – 60.58 <0.001
age [4th degree] 32.39 17.17 – 47.62 <0.001
age [5th degree] 49.67 25.79 – 73.56 <0.001
age [6th degree] 5.50 -22.62 – 33.62 0.701
edu 15.33 14.28 – 16.38 <0.001
year 1.23 0.60 – 1.85 <0.001
Observations 3000
R2 / R2 adjusted 0.286 / 0.284


Table 3.9 Linear Regression Model with cubic basis spline on Age variable.

Figure 3.10 Scatter plot between Wage and Age with Regression line.
(Basis Spline)

Figure 3.10 Scatter plot between Wage and Age with Regression line. (Basis Spline)

Natural Spline

Similar to the basis spline, we will use ns() in R. Also, we have to specify the number of knot or the degree of freedom. In this tutorial, we will determine the degree of freedom instead of the number of knot.

Figure 3.11 The MSE of the natural spline from performing 
5-fold cross validation

Figure 3.11 The MSE of the natural spline from performing 5-fold cross validation

Figure 3.11 shows that the MSE is the lowest when the degree of freedom is equal to 6. Then, we will fit the regression witht the natural spline with degree of freedom equal to 6.

model_natural <- lm(wage ~ ns(age, df = 6) + edu + year, data = data)
summary(model_natural)
  wage
Predictors Estimates CI p
(Intercept) -2487.12 -3742.29 – -1231.95 <0.001
age [1st degree] 39.03 29.88 – 48.17 <0.001
age [2nd degree] 46.81 35.21 – 58.41 <0.001
age [3rd degree] 37.81 27.74 – 47.89 <0.001
age [4th degree] 38.04 28.59 – 47.49 <0.001
age [5th degree] 48.97 26.02 – 71.92 <0.001
age [6th degree] 5.60 -11.95 – 23.14 0.532
edu 15.33 14.28 – 16.38 <0.001
year 1.25 0.63 – 1.88 <0.001
Observations 3000
R2 / R2 adjusted 0.287 / 0.285


Table 3.12 Linear Regression Model with cubic natural spline on Age variable.

Figure 3.13 Scatter plot between Wage and Age with Regression line.
(Natural Spline)

Figure 3.13 Scatter plot between Wage and Age with Regression line. (Natural Spline)

Summary

The table below shoed the \(R^2\) for each model.

Model Python STATA R
Linear Regression xx 0.2555 0.2555
Cubic Polynomial Regression xx 0.2846 0.2846
Piecewise Linear Regression xx 0.2882 0.2758
Basis Spline xx 0.9126 0.2864
Natural Spline xx xx 0.2869

Table 4.1 The coefficient of determination (\(R^2\)) for each model from Python, STATA and R.

Discussion

Reference